596 research outputs found
Multiword expression aware neural machine translation
Multiword Expressions (MWEs) are a frequently occurring phenomenon found in all natural languages that is of great importance to linguistic theory, natural language processing applications, and machine translation systems. Neural Machine Translation (NMT) architectures do not handle these expression well and previous studies have not explicitly addressed MWEs in this framework. In this work, we show that using external linguistic resources and data augmentation we can improve both translations of MWEs that occur in the source, and the generation of MWEs on the target, and improve performance by up to 5.09 BLEU points on MWE test sets. We also devise a MWE score to specifically assess the quality of MWE translation which agrees with human evaluation. We make available the MWEscore implementation â along with MWE-annotated training sets and corpus-based lists of MWEs â for reproduction and extension
462 Machine Translation Systems for Europe
We built 462 machine translation systems for all language pairs of the Acquis Communautaire corpus. We report and analyse the performance of these system, and compare them against pivot translation and a number of system combination methods (multi-pivot, multisource) that are possible due to the available systems.JRC.G.2-Global security and crisis managemen
Towards Effective Disambiguation for Machine Translation with Large Language Models
Resolving semantic ambiguity has long been recognised as a central challenge in the field of Machine Translation. Recent work on benchmarking translation performance on ambiguous sentences has exposed the limitations of conventional Neural Machine Translation (NMT) systems, which fail to handle many such cases. Large language models (LLMs) have emerged as a promising alternative, demonstrating comparable performance to traditional NMT models while introducing new paradigms for controlling the target outputs. In this paper, we study the capabilities of LLMs to translate ``ambiguous sentences'' - i.e. those containing highly polysemous words and/or rare word senses. We also propose two ways to improve their disambiguation capabilities, through a) in-context learning and b) fine-tuning on carefully curated ambiguous datasets. Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions. Our research provides valuable insights into effectively adapting LLMs to become better disambiguators during Machine Translation. We release our curated disambiguation corpora and resources at https://data.statmt.org/ambiguous-europarl
Bridging linguistic typology and multilingual machine translation with multi-view language representations
Sparse language vectors from linguistic typology databases and learned
embeddings from tasks like multilingual machine translation have been
investigated in isolation, without analysing how they could benefit from each
other's language characterisation. We propose to fuse both views using singular
vector canonical correlation analysis and study what kind of information is
induced from each source. By inferring typological features and language
phylogenies, we observe that our representations embed typology and strengthen
correlations with language relationships. We then take advantage of our
multi-view language vector space for multilingual machine translation, where we
achieve competitive overall translation accuracy in tasks that require
information about language similarities, such as language clustering and
ranking candidates for multilingual transfer. With our method, we can easily
project and assess new languages without expensive retraining of massive
multilingual or ranking models, which are major disadvantages of related
approaches.Comment: 15 pages, 6 figure
Few-shot learning through contextual data augmentation
Machine translation (MT) models used in industries with constantly changing
topics, such as translation or news agencies, need to adapt to new data to
maintain their performance over time. Our aim is to teach a pre-trained MT
model to translate previously unseen words accurately, based on very few
examples. We propose (i) an experimental setup allowing us to simulate novel
vocabulary appearing in human-submitted translations, and (ii) corresponding
evaluation metrics to compare our approaches. We extend a data augmentation
approach using a pre-trained language model to create training examples with
similar contexts for novel words. We compare different fine-tuning and data
augmentation approaches and show that adaptation on the scale of one to five
examples is possible. Combining data augmentation with randomly selected
training sentences leads to the highest BLEU score and accuracy improvements.
Impressively, with only 1 to 5 examples, our model reports better accuracy
scores than a reference system trained with on average 313 parallel examples.Comment: 14 pages includince 3 of appendice
A Latent Morphology Model for Open-Vocabulary Neural Machine Translation
Translation into morphologically-rich languages challenges neural machine
translation (NMT) models with extremely sparse vocabularies where atomic
treatment of surface forms is unrealistic. This problem is typically addressed
by either pre-processing words into subword units or performing translation
directly at the level of characters. The former is based on word segmentation
algorithms optimized using corpus-level statistics with no regard to the
translation task. The latter learns directly from translation data but requires
rather deep architectures. In this paper, we propose to translate words by
modeling word formation through a hierarchical latent variable model which
mimics the process of morphological inflection. Our model generates words one
character at a time by composing two latent representations: a continuous one,
aimed at capturing the lexical semantics, and a set of (approximately) discrete
features, aimed at capturing the morphosyntactic function, which are shared
among different surface forms. Our model achieves better accuracy in
translation into three morphologically-rich languages than conventional
open-vocabulary NMT methods, while also demonstrating a better generalization
capacity under low to mid-resource settings.Comment: Published at ICLR 202
Towards Effective Disambiguation for Machine Translation with Large Language Models
Resolving semantic ambiguity has long been recognised as a central challenge
in the field of Machine Translation. Recent work on benchmarking translation
performance on ambiguous sentences has exposed the limitations of conventional
Neural Machine Translation (NMT) systems, which fail to handle many such cases.
Large language models (LLMs) have emerged as a promising alternative,
demonstrating comparable performance to traditional NMT models while
introducing new paradigms for controlling the target outputs. In this paper, we
study the capabilities of LLMs to translate "ambiguous sentences" - i.e. those
containing highly polysemous words and/or rare word senses. We also propose two
ways to improve their disambiguation capabilities, through a) in-context
learning and b) fine-tuning on carefully curated ambiguous datasets.
Experiments show that our methods can match or outperform state-of-the-art
systems such as DeepL and NLLB in four out of five language directions. Our
research provides valuable insights into effectively adapting LLMs to become
better disambiguators during Machine Translation. We release our curated
disambiguation corpora and resources at
https://data.statmt.org/ambiguous-europarl.Comment: WMT 202
- âŚ